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Abstract 

Land cover classification using multispectral satellite image is a very challenging task with nu- 
merous practical applications. We propose a multi-stage classifier that involves fuzzy rule extraction 
from the training data and then generation of a possibilistic label vector for each pixel using the 
fuzzy rule base. To exploit the spatial correlation of land cover types we propose four different 
information aggregation methods which use the possibilistic class label of a pixel and those of its 
eight spatial neighbors for making the final classification decision. Three of the aggregation methods 
use Dempster-Shafer theory of evidence while the remaining one is modeled after the fuzzy k-NN 
rule. The proposed methods are tested with two benchmark seven channel satellite images and the 
results are found to be quite satisfactory. They are also compared with a Markov random field 
(MRF) model-based contextual classification method and found to perform consistently better. 
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1 Introduction 

Land cover classification in remotely sensed images is considered to be a cost effective and reliable 
method for generating up-to-date land cover information [1]. Usually such images are captured by mul- 
tispectral scanners (such as Landsat TM) , that acquire data at several distinct spectral bands producing 
multispectral images. Automated analysis of such data calls for sophisticated techniques for data fusion 
and pattern recognition. Most widely used techniques include statistical modeling involving discrimi- 
nant analysis and maximum likelihood classification and artificial neural networks-based approaches [2] , 
[3] , [4] , [5] . However, these classifiers usually classify a sample to the class for which maximum support 
is obtained, no matter how small this amount of support may be or there may be another class for 
which the support is very close to the maximum. This particular feature is often criticized [1] in context 
of their applicability in land cover classification. "Soft" classifiers can be useful for such problems as 
they can produce a measure of confidence in support of the decision as well as indicate measures of 
confidence in support of alternative decisions, which can be used for further processing using auxiliary 
information. This can result in a more robust and accurate system. In developing soft classifiers for 
land cover analysis two approaches have gained popularity. These are based on (l)fuzzy set theory [7] 
and (2) Dempster and Shafer's evidence theory [6]. There is, of course, the probabilistic approach, that 
we do not pursue further. 

Different fuzzy methodologies for land cover classification in multispectral satellite images have been 
investigated by various researchers. For example, [1] uses fuzzy c-means algorithm, while Kumar et 
al. [9] applied a fuzzy integral method. Fuzzy rulebased systems have been used for classification by 
many researchers [7, 8] for diverse fields of application. Fuzzy rules are attractive because they are 
interpretable and can provide an analyst a deeper insight into the problem. Use of fuzzy rulebased 
systems for land cover analysis is a relatively new approach. Recently Bardossy and Samaniego [10] 
have proposed a scheme for developing a fuzzy rulebased classifier for analysis of multispectral images, 
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where the randomly generated initial rules are fine-tuned by simulated annealing. In [11] Kulkarni and 
McCaslin used fuzzy neural networks for rule extraction. 

The other approach to design soft classifiers use the evidence theory developed by Dempster and Shafer 
[6], [12], [13]. As observed by Lee et al. in [14], for multispectral image analysis there may be a great 
incentive for applying Dempster-Shafer theory of evidence. Since the theory of evidence allows one to 
combine evidences obtained from diverse sources of information in support of a hypothesis, it seems a 
natural candidate for analyzing multispectral images for land cover classification [15], [16], [17]. In all 
these works the approach is to treat each channel image as a separate source of information. Each image 
is analyzed to associate each pixel with some degree of belief pertaining to its belonging to each member 
of a set of hypotheses known as the frame of discernment. Usually some probabilistic techniques are 
employed to assign the degree of belief. In the next stage these belief values from all images for a pixel 
are combined using Dempster's rule [6] to calculate the total support for each hypothesis. In a recent 
paper [18] Jouan and Allard used evidence theory for combining information from multiple sources for 
land use mapping. 

In a satellite image, usually the landcover classes form spatial clusters, i.e., a pixel belonging to a 
particular class is more likely to have neighboring pixels from the same class rather than from other 
classes. Thus, inclusion of contextual information from the neighboring pixels is likely to increase 
classification accuracy. An overview of common contextual pattern recognition methods can be found 
in [19]. In this paper we propose several schemes for classifier design that uses both fuzzy sets theory 
and Dempster-Shafer evidence theory. These are two stage schemes. In the first stage a fuzzy rulcbased 
classifier is developed using a small training set. This classifier is noncontextual and for each pixel 
it generates a possibilistic label vector. In the next stage we aggregate the responses of the fuzzy 
rules over a 3 x 3 spatial neighborhood of a pixel to make the classification decision about that pixel. 
Thus, the decision making process takes into consideration the information available from the spatial 
neighborhood of the pixel. Here we propose four methods for contextual decision making. In the 
experimental results we compare the performance of the contextual classifiers with the fuzzy rulcbased 
noncontextual classifiers designed in the first stage. We also provide a comparison of the performance of 
the proposed classification schemes with a recent Markov random field (MRF) model-based contextual 
classification scheme proposed by Sarkar et al. [20]. 



2 Designing the Fuzzy Rule base 

We use a multi-stage scheme for designing fuzzy rule based classifiers. The set of gray values correspond- 
ing to a pixel in the channel images is used as the feature vector for that pixel. In the first stage a set of 
labeled prototypes representing the distribution of the training data is generated using a Self-organizing 
Feature Map (SOFM) [23] based algorithm developed in [24]. The algorithm dynamically decides the 
number of prototypes based on the training data. Then each of these prototypes is converted to a fuzzy 
rule. 

Next the fuzzy rules are tuned by modifying the peaks as well as the spreads of the fuzzy sets associated 
with the rules. The tuned rules can readily be used to classify unknown samples based on the firing 
strengths of the rules. Typically, a test sample is classified to the class of the rule generating the highest 
firing strength. For the sake of completeness we first give a brief description of SOFM. 



2.1 Kohonen's SOFM algorithm 

SOFM is formed of neurons placed on a regular (usually) ID or 2D grid. Thus each neuron is identified 
with a index corresponding to its position in the grid (the viewing plane) . Each neuron i is represented 
by a weight vector w, £ 3? p where p is the dimensionality of the input space. In t-th training step, a data 
point x 6 W is presented to the network. The winner node with index r is selected as r ~ arg min{||x — 

i 

Wj ;t _i||}. w r! i_i and the other weight vectors associated with cells in the spatial neighborhood N t (r) 
of r are updated using the rule: 

Wi,( = w M _i + a(t)h ri (t)(x - Wi,t-i), 



2 



where a(t) is the learning rate and h r i(t) is the neighborhood kernel (usually Gaussian). The learning 
rate and the radius of the neighborhood kernel decrease monotonically with time. During the iterative 
training the SOFM behaves like a flexible net that folds onto the "cloud" formed by the input data. A 
trained SOFM exhibits remarkable and useful properties of topology preservation and density matching 
and often used for visualization of metric-topological relationships and distributional density properties 
of training data X — {x 1; Xjv} in 5R P through their mapping onto the viewing plane. From clustering 
viewpoint, SOFM has the advantage of avoiding under-utilization of the prototypes. 

2.2 Generation of Prototypes 

First we train a one dimensional SOFM with c nodes, where c is the number of classes. We do so 
because the smallest number of rules that may be required is equal to the number of classes. At the 
end of the training the weight vector distribution of the SOFM reflects the distribution of the input 
data. Then the training data is divided into a c partition according to their closeness to c weight 
vectors, V° = {v", v°, v°} C 3? p . Each of the prototypes is labeled based on the class information of 
corresponding partition using majority voting. However, such a set of prototypes may not classify the 
data satisfactorily because the class information is not used in the training resulting in the possibility 
that a prototype may represent data from more than one class significantly. 

We use the prototype refinement scheme described in [24]. The basic idea behind this refinement 
algorithm is that a useful prototype should satisfy two criteria: (i) it should represent an adequate 
number of points, (ii) only one of the classes should be strongly represented by it. 

The prototype refinement scheme applies four operations, deletion, modification, merging and splitting on 
the set of prototypes while trying to fulfill the above conditions. In the training data X — {x l7 ...,xjv} 
let there be Nj points from class j. The refinement stage uses just two parameters K\ and K 2 to 
dynamically generate c + 1 retention thresholds known as a global retention threshold a and a set of 
class- wise retention threshold 0k (one for each class), to evaluate the performance of each prototype, a 
and (3k are computed dynamically (not fixed) for the t-th iteration using the following formulae: 

a* = (K, l^- 1 !)" 1 and (3\ = (K 2 l^l) -1 , 

where V£ _1 = {v 4 | v, G V* -1 , Cj = k}. 

Here V l ~ x is the set of prototypes obtained after (t — 1) iterations of the algorithm. To consider a 
prototype useful it must represent more than a l N training points. Further, a prototype must represent 
more than (3j.Nk points from class k to be considered a (potential) prototype for class k. 

Finally, the set of prototypes is again refined by SOFM algorithm with winner-only update strategy. 
After a few iterations this algorithm produces a set of adequate number of prototypes that represents 
the training data much better than the initial one. For details the readers are referred to [24]. The final 
set of prototypes yf mal = {v{ maZ , • • • ,w{ mal ], where c > c, is used to generate the set of initial fuzzy 
rules. 

2.3 Designing fuzzy rulebase 

A prototype Vj represents a cluster of points for class k. This cluster can be described by a fuzzy rule 
of the form: Rf If x is CLOSE TO Vj then class is k. This rule can be further translated into : 

Ri-. If xi is CLOSE TO v a AND • • • AND x p is CLOSE TO v lp then class is k. 

Note that, this is just one possible interpretation of "x is CLOSE TO Vj". The first form requires a 
multidimensional membership function while the second form requires several one dimensional member- 
ship functions and a conjunction operator. In general, the two forms will not produce the same output. 
Depending on the choice of the membership function and the conjunction operator, the forms may lead 
to the same output. But, since the representation by the atomic clauses is a plausible realization of the 
multidimensional form, the performance of the systems built using either form is not going to be much 
different. 
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The fuzzy set CLOSE TO Vij can be modeled by triangular, trapezoidal or Gaussian membership 
function. In this investigation, we use the Gaussian membership function, 

HijfajiVij^ij) = exp(-(xj - Vij) 2 1 'ui j 2 ). 

Given a data point x with unknown class, we first find the firing strength of each rule. Let a^(x) denote 
the firing strength of the i th rule on a data point x. We assign the point x to class k if a r = max,(ai(x)) 
and the r th rule represents class k. 

The performance of the classifier depends crucially on the adequacy of the number of rules used and 
proper choice of the membership functions. In our case each fuzzy set is characterized by two parameters 
and Oij. Let the initial set of fuzzy rules be R° — | i = 1,2,. . . , c}. The parameters and <7?- 

for fuzzy sets in the antecedent part of a rule R- £ R° are obtained from the prototype v / maZ £ V i%nal 
as follows: 

v% = v{j nal and a% = k w {^( ]T (x kj - ^ iW ) 2 )/|X i |, 

where X t is the set of training data closest to v { ma( and k w > is a constant parameter that controls 
the initial width of the membership function. If k w is small, then specificity of the fuzzy sets defining the 
rules will be high and hence each rule will model a small area in the input space. On the other hand, a 
high k w will make each rule cover a bigger area. Since the spreads are tuned, in principle k w should not 
have much impact on the final performance, but in practice the value of k w may have a significant impact 
on the classification performance for complicated data sets because of the local minimum problem of 
gradient descent techniques. In the current work the values of k w s are found experimentally. One can 
use a validation set for this. 

The initial rulebase R° thus obtained can be further fine tuned to achieve better performance. But 
the exact tuning algorithm depends on the conjunction operator (implementing AND operation for the 
antecedent part) used for computation of the firing strengths. The firing strength can be calculated 
using any T-norm [7]. Use of different T-norms results in different classifiers. The product and the 
minimum are the most popular choices of T-norms. 

Though it is much easier to formulate a calculus based tuning algorithm if product is used, its use is 
conceptually somewhat unattractive. To illustrate the point let us consider a rule having two atomic 
clauses in its antecedent. If the two clauses have truth values a and 6, then intuitively the antecedent is 
satisfied at least to the extent of min(a, b). However, if product is used as the conjunction operator, we 
always have ab < min(a, b). Thus we always under-determine the importance of the rule. This does not 
cause any problem for non-classifier fuzzy systems because the denazification operator usually performs 
some kind of normalization with respect to the firing strength. But in classifier type applications a 
decision may appear to be made with a very low confidence, when actually it is not the case. In certain 
cases, such a situation may lead to overemphasis on total ignorance under evidence theory framework. 
Thus to avoid the use of the product and at the same time to be able to apply calculus to derive update 
rules we use a soft-min operator. 

The soft-match of n positive number xi,x 2 , ...,a;„ is defined by 

SM( Xl ,x 2 ,...,x n ,q) = <^ 2_ nLK . 

where q is any real number. SM is known as an aggregation operator with upper bound of value 1 
when Xi £ [0, 1] Vi. This operator is used by different authors [25, 26] for different purposes. It is easy to 
see that as q — > — oo the soft-match operator behaves like "min" operator. Thus we define the softmin 
operator as the soft match operator with a sufficiently negative value of the parameter q. The firing 
strength of the r-th rule computed using softmin is 

f X) 7 = 1 (Mrj { x j'i v rj, °V.j)) 9 1 

Qr (x) = j 1 . 

At the value q = —10 its behavior resembles very closely to the min operator, so we use q = —10.0. A 
very low value of q may lead to numerical instability in the computation. 
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2.4 Tuning of the Rule Base 



Let x£lbc from class c and R c be the rule from class c giving the maximum firing strength a c for x. 
Also let R^ c be the rule from the incorrect classes having the highest firing strength a^ c for x. We use 
the error function E, 

E=Y / ( 1 - a c + a^ c ) 2 . (1) 

This kind of error function has been used in [29]. We minimize E with respect to v c j, v~, c j and o~ c j, 
a^ c j of the two rules R c and i?-, c using gradient decent. Here the index j corresponds to clause number 
in the corresponding rule. The tuning process is repeated until the rate of decrement in E becomes 
negligible resulting in the rule base R* mal . 

Since a Gaussian membership function is extended to infinity, for any data point all rules will be fired to 
some extent. In our implementation, if the firing strength is less than a threshold, e (rs 0.01), then the 
rule is not assumed to be fired. The threshold is set considering approximate 2a limit of the Gaussian 
membership functions. Thus, under this situation, the rulebase extracted by the system may not be 
complete with respect to the training data. This can happen even when we use membership functions 
with triangular or trapezoidal shapes. This is not a limitation but a distinct advantage, although for 
the data sets we used, we did not encounter such a situation. If no rule is fired by a data point, then 
that point can be thought of as an outlier. If such a situation occurs for some test data, then that will 
indicate an observation not close enough to the training data and consequently no conclusion should be 
made about such test points. 



3 Decision making with aggregation of spatial contextual in- 
formation 

In this section we describe four decision making schemes. The first one involves a simple aggregation 
of the fuzzy label vectors of the neighboring pixels. The scheme is modeled after the well known fuzzy 
k-nearest neighbor algorithm proposed by Keller et al. [27]. The other three methods use Dempster- 
Shafer theory of evidence. For the sake of completeness we give a brief introduction of Dempster-Shafer 
theory of evidence before we describe the decision making methods in detail. 



3.1 Dempster-Shafer theory of evidence 

Let 6 be the universal set and P(0) be its power set. A function to : P(0) — > [0, 1] is called a Basic 
Probability Assignment (BPA) whenever m(0) = and XMce m (^) = 1- Here m(A) is interpreted as 
the degree of evidence supporting the claim that the "truth" is in A and in absence of further evidence 
no more specific statement can be made. Every set A £ -P(0) for which m(A) > is called a focal 
element of to. Evidence obtained in the same context from two distinct sources and expressed by two 
BPAs to 1 and to 2 on the same power set P(0) can be combined by Dempster's rule of combination to 
obtain a joint BPA to 1 ' 2 as: 

if A = 



Reve K = J2 BnC=0 m 1 (B)m 2 (C). 

Equation (2) is often expressed with the notation m 1 ' 2 = to 1 © to 2 . The rule is commutative and 
associative. Evidence from any number (say k) of distinct sources can be combined by repetitive 
application of the rule as to = to 1 © to 2 © • • • © m k — ©* =1 m l . 

To select the optimal decision from the evidence embodied in a BPA, typically we construct a probability 
function P e on 6. This is done through a transformation known as pignistic transformation [28]. The 
pignistic probability for 9 £ can be expressed in terms of BPAs as follows: 

r»= E r§ (3) 

ACe,0£A ' 
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Optimal decision can now be chosen in favor of 9q with the highest pignistic probability. 

3.2 Aggregation of context information for decision making 

Let there be c classes, C={Ci, C2, • • • C c }. For each pixel P the fuzzy rule base generates a fuzzy label 
vector a. £ 5R C such that the value of the fc-th component of a, ak (0 < < 1) represents the confidence 
of the rule base supporting the fact that the pixel P belongs to class Ck- Strictly speaking, a £ 5R C is a 
possibilistic label vector [7]. The value of ctk is computed as follows: 

Let r be the number of rules in the fuzzy rulebase. Since c < r, there could be multiple rules corre- 
sponding to a class. Then a fe is the highest firing strength produced by the rules corresponding to the 
class Cfc for pixel P. We treat this value as the confidence measure of the rulebase pertaining to the 
membership of pixel P to the class Ck- However, if ak is less than a threshold, say 0.01, it is set to 0. 

In our decision making schemes we consider a pixel together with the pixels within its 3 x 3 spatial 
neighborhood. We identify the pixel p under consideration as p° and its eight neighbors as p 1 ,p 2 , ■ ■ ■ p s . 
The corresponding possibilistic label vectors assigned to these nine pixels are denoted as a , a 1 , • • • a 8 . 

3.2.1 Method 1: Aggregation of possibilistic labels 

This is a simple aggregation scheme modeled after the fuzzy k-NN method of Keller et al. [27] . Given 
a set of nine label vectors {a , a 1 , • • • a. 8 } an aggregated possibilistic label vector a A is computed as 
follows: 

a A = L» = o a _ (4) 
9 v ; 

The pixel p° is assigned to class Ck, such that a A > otf Mi = 1, 2, • • • c. Note that, though the label 
vector a corresponds to the pixel to be classified, no special emphasis (importance) is given to it in 
this aggregation scheme. 

3.2.2 Method 2: Aggregation with Bayesian belief modeling 

This aggregation method as well as the next two use the evidence theoretic framework of Dempster- 
Shafcr. Within this framework the set of classes, C is identified as the frame of discernment. A body of 
evidence is represented by a BPA m over subsets of C. The value m(A) denotes the belief in support of 
the proposition that the true class label of the pixel of interest is in A C C. In context of our problem 
we shall (1) identify distinct bodies of evidence (BOE), (2) formulate a realistic scheme of assigning 
the BPAs to the relevant subsets of C, (3) combine evidences provided by all BOEs, (4) compute the 
pignistic probability for each class from the combined evidence, and (5) make a decision using the 
maximum pignistic probability principle. 

In this scheme we identify eight BOEs for eight neighbor pixels with corresponding BPAs denoted 
as m 1 , m 2 , • • • , to 8 . We consider the Bayesian belief structure, i.e., each focal element has only one 
element. Assigning BPA to a subset essentially involves committing some portion of belief in favor of 
the proposition represented by the subset. So the scheme followed for assigning BPAs must reflect some 
realistic assessment of the information available in favor of the proposition. We define m l as follows: 

m i {{C k })= K + a ° fc) ,fc = l,2,...,c (5) 

where S — X)fc=i( a fc + a k) ^ s a normalizing factor. Thus, each BPA contains c focal elements, one 
corresponding to each class and the assigned value m l ({Cfc}) is influenced by the magnitudes of firing 
strengths produced by the rule base in support of class Ck for the pixel of interest p° and its z-th neighbor 
p % . Clearly the label vector a influences all eight BPAs. Hence it is expected that in the final decision 
making, the influence of a will be higher than any neighbor. Such an assignment is motivated by the 
fact, the spatial neighbors usually are highly correlated, i.e., pixel p° and its immediate neighbors are 
expected from the same class. 

Thus for eight neighboring pixels we obtain eight separate BPAs. These BPAs are combined to get the 
global BPA applying the Dempster's rule repeatedly. It can be easily seen that the global BPA ra G is 
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also Bayesian and can be computed as 

"°<w>- A w ' ({ %L - < 6 » 

£l=iIL=i m*({Ci}) 

In this case the pignistic probability P c (Ck) is the same as m G ({Cfc}). So the pixel p° is assigned to 
class C k such that m G ({C fc }) > m G ({C ; }) VC* ; e C. 



3.2.3 Method 3: Aggregation with non-Bayesian belief 

Here also the BOEs are identified in the same way as in the previous method. However, we allow the 
assignment of belief to subsets of C having two elements i.e., the subsets {C;,C m : I < m and l,m = 
1, 2, • • • , c}. We define m l as follows: 

m'({C fc })= K + a ° fc) ,fc = l,2,...,c (7) 

mr~< n 7 IN { a l + a m) + ( a m + a< l) /o\ 

m l {{C u C m :l<m}) = ±-i m ' K ^, (8) 

l,m = 1, 2, c, where 

/c— c I — c— 1 m—c 

s = 5>« + a° fe ) + E E t(«i + «™) + ( a ™ + «?)]■ 

fe=l /=1 m=i + l 

In Method 2 we exploited only the correlation of spatial neighbors. However, for satellite images when 
a pixel falls at the boundary of some land cover type, it may correspond to more than one land cover 
type. Since the chance of a pixel to have representation from more than two land cover types is not 
high, we restrict the cardinality of the focal sets to two. Using (8) for eight neighboring pixels we obtain 
eight separate bodies of evidence. The global BPA for the focal elements is then computed applying 
Dempster's rule repeatedly. In our previous method we could use Eq. (6) to compute the global BPA 
since we dealt with singleton focal elements only. In the present case, we have to compute the global 
BPA in steps, combining two bodies of evidence at a time and preparing an intermediate BPA which 
will be combined with another BPA and so on. Once the global BPA is computed, the class label is 
assigned according to highest pignistic probability. 



3.2.4 Method 4: Aggregation using evidence theoretic k-NN rule 

This method is fashioned after the evidence theoretic k-NN method [12]. Here nine BPAs m°, to 1 , • • • , m 8 
are identified corresponding to the possibilistic label vectors a , a 1 , • • • , a. 8 . The BPA m % is assigned as 
follows: Let q = argmax{a l k ), then 

k 

™*({C q }) = ^ (9) 

m\C) = \-a\ (10) 

m\A) = 0VAeP(C)\{C,{C q }} (11) 



Thus there is only one focal element containing one element in each BOE. The rest of the belief is 
assigned to the frame of discernment C, which can be interpreted as the amount of ignorance. The 
evidences are then combined using Dempster's rule and the class label is assigned according to highest 
pignistic probability. 

Like Method 1 in this method also no special emphasis is given to the pixel under consideration, p°, over 
the neighboring pixels. Whereas in Method 2 and Method 3, a influences all the BPAs i.e., ot° plays 
a special role. It can be seen later in experimental results that these two approaches provide different 
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classification efficiencies depending on the nature of the spatial distribution of the classes. Intuitively, 
if pixels corresponding to different land cover types are scattered all over the image, neighboring pixels 
may not be given the same importance as that of the central pixel for optimal performance of the 
classifier. On the other hand, if pixels of a particular land cover type form spatial clusters, then giving 
equal importance to the neighboring pixels may be desirable. However, the optimal weight to be given 
to the neighboring pixels to get the best performance should depend on the distribution of different 
land cover types on the image being analyzed. To realize this we modify the BPA assignment scheme 
of current method as follows: Let q = argmax{d l k ), then 

k 

m°({C q }) = a° q , for i = (12) 

m *({C<z}) = wa q , otherwise (13) 

m°(C) = 1 -a°, for i = (14) 

m l (C) = 1 — wa q , otherwise (15) 

m\A) = 0VAeP(C)\{C,{C q }} (16) 

where < w < 1 is a weight factor that controls the contribution of the neighboring pixels in the 
decision making process. The optimal value of w for best classification performance depends on the 
image under investigation and can be learnt during training using grid search. 



4 Experimental results and discussions 

We report the performances of the proposed classifiers for two multispectral satellite images. We call 
them Satimagel and Satimage2. Satimagel is a Landsat-TM image available along with full ground 
truth in the catalog of sample images of the ERDAS software and used for testing various algorithms 
[9]. The image covers the city of Atlanta, USA and its surrounding. Satimage2 is also a Landsat-TM 
image depicting outskirts of the city of Vienna, Austria [3] . 

The Sat-imagel is of size 512 x 512 pixels captured by seven detectors operating in different spectral 
bands from Landsat-TM3. Each of the detectors generates an image with pixel values varying from to 
255. The 512x512 ground truth data provide the actual distribution of classes of objects captured in the 
image. From this data we produce the labeled data set with each pixel represented by a 7-dimensional 
feature vector and a class label. Each dimension of a feature vector comes from one channel and the 
class label comes from the ground truth data. Figure 1(a) shows the ground truth for Satimagel where 
different classes are indicated with different shades of grey. 

Satimage2 also is a seven channel Landsat-TM image of size 512 x 512. However, due to some character- 
istic of the hardware used in capturing the images the first row and the last column of the images contain 
gray value 0. So we did not include those pixels in our study and effectively worked with 511 x 511 
images. The ground truth containing four classes is used for labeling the data. Figure 1(c) shows the 
ground truth for Satimage2 where different classes are indicated with different shades of grey. 

In our study we generated 4 sets of training samples for each of the images. For Satimagel each training 
set contains 200 data points randomly chosen from each class. For Satimage2 we include in each training 
set 800 randomly chosen data points from each of the four classes. The classifiers designed with the 
training data are tested on the whole images. 

The classification results of the non-contextual fuzzy rulebascd systems arc summarized in the parts (a) 
and (b) of Table 1 for Satimagel and Satimage2 respectively. Parts (a) and (b) of Table 2 summarize 
the performances of the four classifiers using different methods of aggregation of spatial information for 
Satimagel and Satimage2 respectively. We used the same fuzzy rulebases for respective training sets as 
used previously, but the decision making step is modified to use the proposed four methods. 

Comparison between Table 1(a) and Table 2(a) shows that for Satimagel Method 3 performs the best 
with improvements varied between 1.12% and 2.05% and the best performing classifier (for training set 
4) achieves an error rate as low as 10.45%. This is closely followed by Method 2. Methods 1 and 4 
show marginal improvement for training sets 1 and 2 while for training sets 3 and 4 their performance 
degrades a little. Comparison of Tables 1(b) and 2(b) shows that although the classifiers using Method 3 
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Tabic 1: Performances of non-contextual fuzzy rulebased classifiers 



Trng. 
Set 


No. of 
rules 




Error Rate in 
Training Data 


Error Rate in 
Whole Image 


(a) Satimagcl 


1. 


30 


5.0 


12.0% 


13.6% 


2. 


25 


6.0 


14.3% 


14.47% 


3. 


25 


5.0 


12.0% 


13.03% 


4. 


27 


4.0 


12.6% 


12.5% 


(b) Satimage2 


1. 


14 


2.0 


16.3% 


14.14% 


2. 


14 


2.0 


16.3% 


14.04% 


3. 


12 


2.0 


17.09% 


14.01% 


4. 


11 


2.0 


17.34% 


14.23% 



Table 2: Performances of fuzzy rulebased contextual classifiers 



Trng. 
Set 


Error Rate in Whole Image 


Method 1 | Method 2 | Method 3 | Method 4 


(a) Satimagel 


1. 


13.32% 


11.93% 


11.48% 


13.43% 


2. 


14.00% 


13.01% 


12.62% 


14.38% 


3. 


13.66% 


11.54% 


11.23% 


13.90% 


4. 


12.98% 


11.01% 


10.45% 


13.18% 


(b) Satimage2 


1. 


11.55% 


12.96% 


12.64% 


11.75% 


2. 


11.55% 


12.75% 


12.48% 


11.65% 


3. 


11.24% 


12.45% 


12.23% 


11.35% 


4. 


11.24% 


12.53% 


12.28% 


11.44% 




Figure 1: (a) The ground truth for Satimagel. (b) The classified image for Satimagel (Method 3, 
Training set 1). (c)The ground truth for Satimage2. (d) The classified image for Satimage2 (Method 4, 
Training set 1). 
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Table 3: Class- wise average classification performance for the four proposed methods 1-4 



Landcover 

types (Frequency) 


Average classification rate (%) 


1 | 2 | 3 | 4 


(a) Satimagel 


Forest (176987) 


93.25 


90.80 


91.33 


94.47 


Water (23070) 


90.95 


90.42 


90.64 


92.29 


Agriculture (26986) 


68.49 


83.60 


83.12 


60.12 


Bare ground (740) 


58.15 


65.46 


65.92 


55.10 


Grass (12518) 


53.19 


73.75 


75.65 


55.59 


Urban area (11636) 


80.82 


80.83 


79.76 


75.32 


Shadow (3197) 


48.97 


96.09 


95.68 


40.42 


Clouds (358) 


99.78 


99.78 


99.78 


99.79 


(b) Satimage2 


Built-up land (48502) 


85.83 


84.09 


84.38 


85.55 


Forest (161522) 


95.46 


94.00 


94.22 


95.68 


Water (2643) 


96.02 


93.77 


94.32 


96.13 


Agriculture (48554) 


68.20 


68.03 


67.79 


66.92 



increase the classification accuracy by about 1.5%, Method 1 is clearly the best performer for Satimage2 
with improvements ranging between 2.49% and 2.99%. Similar performances are achieved by Method 
4. Method 4 used w = 1 for both Table 2(a) and Table 2(b). Figure 1(b) shows the classified image 
corresponding to training set 1 and Method 3 for Satimagel. Figure 1(d) displays classified image (for 
training set 1, Method 4) for Satimage2. 

Our experimental results demonstrate that for Satimagel Methods 3 and 2 work well while the other 
methods work better for Satimage2. These differences of performances can be explained if we look 
into the way the neighborhood information is aggregated in each method and the nature of the spatial 
distribution of the classes in the images. In Method 1 the fuzzy label vectors of the central pixel (the 
pixel of interest) and its eight neighboring pixels are treated in same way for aggregation. The same is 
true for Method 4 though evidence theoretic approach is used for information aggregation. In Methods 

2 and 3 eight BPAs are defined, each of them is assigned using the possibilistic label vector of the 
central pixel and that of one of the eight neighboring pixels. Thus all the BPAs are heavily influenced 
by the central pixel, and consequently, so is the final decision. Hence, it is expected that for the images 
dominated by large stretches of homogeneous areas (i.e., area covered by single land cover type) can be 
classified better by Methods 1 and 4. Also for Method 4 there may exist an optimal weight (different 
from 1) with improved performance. We shall see later that this is indeed the case. Comparing Figure 
1(c) (ground truth for Satimage2) with Figure 1(a) (ground truth for Satimagel) reveals that Satimagel 
mostly consists of small (often very small) patches of land cover types, while Satimage2 has large patches 
of land cover types. So for Satimagel neighborhood information needs to be used judiciously to get an 
improved classifier. This is what achieved by Methods 2 and 3. 

To have a closer look at class-wise performances of the proposed methods we have presented in Table 

3 the class-wise classification performances. It can be seen from the table that for Satimage2 all four 
methods perform comparably for each of the four classes, with a slender edge in favor of Methods 1 
and 4. However, for Satimagel, which contains more complicated spatial distribution of the classes, 
there is significant class-specific variation of performance among different methods. For the classes 
Forest, Water, Urban area and Clouds all methods perform nearly equally (the variation is within 5% 
approximately) well, however for other classes the performances differ significantly. For Agriculture, the 
second largest class, Methods 2 and 3 have accuracy of over 83%, while Methods 1 and 4 are 68.49% 
and 60.12% accurate respectively. For the class Bare ground, Methods 2 and 3 outperform the others 
comfortably. For the class Shadow, there is a huge performance gap between the Methods 2-3 (« 96%) 
and the Methods 1 and 4 (49% and 40%). However, since the frequency of the Shadow class is very 
small, this variation does not affect the overall accuracy significantly. 

It can be observed from the above tables that in a few cases classifiers with fewer number of rules 
perform better than those with larger number of rules. This is due to the fact that each classifier is 
trained with different randomly generated training sets and there is some randomness involved in the 



10 




Figure 2: The result of grid search for optimal value of w using a 100 x 100 sub-images of (a) Satimagel 
and (b) Satimage2. The numbers beside the plots correspond to the rulebase no. 

Table 4: Classification performances of classifiers using modified method 4 with w — 0.35 on Satimagel 



Training Set 


1 


2 


3 


4 


Error rate 


11.10% 


12.24% 


11.31% 


10.33% 



SOFM based prototype generation scheme. These result in different rulebases, where one with more 
number of rules may have a few rules in partial conflict. 

The experiments are carried out using a desktop computer with P4 processor (3.0 GHz) and 256 MB 
memory. Apart from the image sizes, the computation time depends largely on the number of rules and 
number of classes. For an image the computation time can be divided into two parts, computation of 
the fuzzy label vectors using rulebase and decision making by applying the evidence-theoretic methods. 
The first part is the same across all 4 methods and requires approximately 50 seconds for Satimagel 
(25-30 rules and 8 classes) and 25 seconds for Satimage2 (11-14 rules and 4 classes). The decision making 
using Methods 1-4 require 2, 10, 50 and 5 seconds respectively for Satimagel and 1, 5, 10, and 2 seconds 
respectively for Satimage2. As expected, Method 1 is the fastest one because it involves simple averaging 
of the label vectors. On the other hand, Method 3 requires dealing with both singleton and doubleton 
sets of classes that increase the computation time almost quadratically with number of classes. 

To investigate the possibility of optimization of Method 4, now we use the modified Equations (12)- 
(16) that incorporate a weight factor w for controlling the contribution of the information from the 
neighborhood in the decision making process. The value w = 1.0 makes the method same as the original 
Method 4. To find an optimal value of w we use a 100 x 100 sub-image from each image and find an 
optimal w based on these blocks. We use grid search to find the optimal w. Note that, the rulebases 
arc the same as used earlier. For example, for Satimagel, for each of the four rule sets we find the 
optimal w using the classification error on the selected block of image. Figure 2(a) depicts the variation 
of classification error as a function of w for Satimagel. It is interesting to observe that for Satimagel for 
all four rule bases the best performances are achieved around w — 0.35. So we should use the modified 
Method 4 with w — 0.35 for each of the four rulebases and Table 4 displays the classification errors for 
the whole image. Comparing Table 4 with column 5 of Table 2(b), we find a consistent improvement 
with w = 0.35 in all four cases. The improvement varied between 2.14% and 2.75%. We also tried to 
find an optimal w for each of the four rulebases for Satimage2. Figure 2(b) displays the classification 
error as a function of w. In this case, for all four rulebases we find an optimal value of w = 1.0. This 
again confirms the fact that when different land cover types form spatially compact clusters, neighbor 
and central pixels play equally important roles in decision making. 

4.1 Comparison with an MRF model-based classifier 

Recently contextual methods based on Markov random field (MRF) models have become popular for 
classification of multispectral [20] as well as multisource [21], [22] satellite images. Typically a Bayesian 
framework is used to model the posterior probability. To estimate the model parameters an energy 
function is minimized using an optimization technique like simulated annealing, genetic algorithm etc. 
These methods usually utilize the contextual information in the training stage. Also their accuracy 
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Table 5: Comparison of classification performance (percentage errors) of MRF-based method with the 
proposed methods 



Image 


MRF 


Fuzzy 


Proposed contextual methods 




method 


rulebasc 


1 


2 


3 


4 


Satimagc 1 


18.79 


9.86 


5.96 


8.58 


8.18 


5.81 


Satimagc 2 


17.81 


16.41 


12.55 


14.68 


14.47 


12.45 



depends on the correctness of assumption about the class-conditional density functions. Here we compare 
the proposed methods with the MRF model-based approach introduced by Sarkar et al. [20]. This 
approach is based on capturing intrinsic characters of tonal and textural regions of a multispectral 
image. Using an initially oversegmcnted image and the multispectral image it defines an MRF over 
the region adjacency graph (RAG) of the initial segmented image. Because of the energy minimization 
of the associated MRF, the oversegmented regions are likely to be merged. To compare the similarity 
of adjacent regions, a multivariate statistical test is incorporated while minimizing the energy function 
associated with the underlying MRF. A cluster validation technique is also used for finding optimal 
segments. 

The implementation of the MRF-based scheme used for the experiments here is capable of handling 
only up to four channels. Therefore, we used channels 3, 4, 5, and 7 for Satimagel and channels 2, 
3, 5 and 7 for Satimage2. The choice is based on visual inspection of the channel images as well as 
some experiments with various combinations of channels with a view to finding a good combination of 
features suitable for the MRF-based method. Using the same training-test data sets as used by the 
previous experiments, for Satimagel the MRF-based method fails to discriminate some of the classes, 
for example, a significant part of the water becomes forest. So we used a more detailed 10-class ground 
truth data. Unfortunately this detailed ground truth is incomplete, i.e., does not cover the entire image. 
Therefore, we used a training set with 1463 data points and a test set with 1480 data points. For 
Satimage2 the training set consisted of 3200 points as earlier. The classification performance of the 
MRF-based method, non-contextual fuzzy rulebased method and proposed contextual decision making 
schemes on the test sets is summarized in Table 5. All experiments are conducted with the same training 
and test data sets for respective images as described in this section. 

Table 5 shows that for Satimagel the error rate of the ordinary (noncontextual) fuzzy rulebased classifier 
is almost half of that by the MRF-based method. All of the four proposed contextual decision making 
schemes reduce the error rate further with Method 4 achieving the lowest error rate of 5.81%. In case of 
Satimagc2 also the error rate for the noncontextual fuzzy rulebased classifier is about 2.5% less than the 
MRF based method. The proposed contextual schemes again produce further improved classification 
performance. Here again Method 4 exhibits the lowest error rate of 12.45% closely followed by Method 
1 (12.55%). Thus it can be observed that for both the images the ordinary fuzzy rulebased classifiers as 
well as the four contextual classifiers perform consistently better than the MRF-model based method 
tested here. 



5 Conclusion 

We proposed a comprehensive scheme for designing contextual classifiers for land cover classification in 
multispectral satellite images. This is a multi-stage scheme. First an SOFM-based dynamic prototype 
generation algorithm is used to generate an adequate number of prototypes. Then the prototypes are 
converted into fuzzy rules and they are further fine-tuned to design an efficient fuzzy rulebased classifier. 
However, this classifier is a non-contextual one. Since the landcover classes usually appear in spatial 
clusters, a classification scheme using contextual information can be more efficient than non-contextual 
one. Hence we develop four decision making schemes which use information from spatial neighborhood 
of the pixels. For assigning a class label to a pixel the schemes use the information generated by the fuzzy 
rulebase in the form of possibilistic label vectors for the pixel under consideration as well as its eight 
spatial neighbors. Three of the proposed schemes use evidence theoretic framework and the remaining 
one is based on the fuzzy k-NN rule. 

The suitability of a particular scheme depends to some extent on the nature of the image to be classified. 
The performance of the methods on the training / validation data can be used to decide on the best 
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classifier for a given situation. One of the methods has a parameter that can be tuned based on the 
training/validation data to get a classifier with improved performance. 
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